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Abstract 

We adapt the ideas underlying the success of Deep Q-Leaming to the continuous 
action domain. We present an actor-critic, model-free algorithm based on the de¬ 
terministic policy gradient that can operate over continuous action spaces. Using 
the same learning algorithm, network architecture and hyper-parameters, our al¬ 
gorithm robustly solves more than 20 simulated physics tasks, including classic 
problems such as cartpole swing-up, dexterous manipulation, legged locomotion 
and car driving. Our algorithm is able to find policies whose performance is com¬ 
petitive with those found by a planning algorithm with full access to the dynamics 
of the domain and its derivatives. We further demonstrate that for many of the 
tasks the algorithm can learn policies “end-to-end”: directly from raw pixel in¬ 
puts. 


1 Introduction 


One of the primary goals of the field of artificial intelligence is to solve complex tasks from unpro¬ 
cessed, high-dimensional, sensory input. Recently, significant progress has been made by combin¬ 
ing advances in deep learning for sensory processing ( [Krizhevsky et al. 2012[) w ith reinforcement 
learning, resulting in the “Deep Q Network” (DQN) algorithm ( |Mnih et al. |2015| l that is capable of 
human level performance on many Atari video games using unprocessed pixels for input. To do so, 
deep neural network function approximators were used to estimate the action-value function. 


However, while DQN solves problems with high-dimensional observation spaces, it can only handle 
discrete and low-dimensional action spaces. Many tasks of interest, most notably physical control 
tasks, have continuous (real valued) and high dimensional action spaces. DQN cannot be straight¬ 
forwardly applied to continuous domains since it relies on a finding the action that maximizes the 
action-value function, which in the continuous valued case requires an iterative optimization process 
at every step. 

An obvious approach to adapting deep reinforcement learning methods such as DQN to continuous 
domains is to to simply discretize the action space. However, this has many limitations, most no¬ 
tably the curse of dimensionality: the number of actions increases exponentially with the number 
of degrees of freedom. For example, a 7 degree of freedom system (as in the human arm) with the 
coarsest discretization ai S {—k, 0, fcj for each joint leads to an action space with dimensionality: 
3^ = 2187. The situation is even worse for tasks that require fine control of actions as they require 
a correspondingly finer grained discretization, leading to an explosion of the number of discrete 
actions. Such large action spaces are difficult to explore efficiently, and thus successfully training 
DQN-like networks in this context is likely intractable. Additionally, naive discretization of action 
spaces needlessly throws away information about the structure of the action domain, which may be 
essential for solving many problems. 


In this work we present a model-free, off-policy actor-critic algorithm using deep function approx¬ 
imators that can learn policies in high-dimensional, continuous action spaces. Our work is based 
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on the deterministic policy gradient (DPG) algorithm ( [Silver et'aL 2014 1 (itself similar to NFQCA 
( jHafner & Riedmiller||201l] l, and similar ideas can be found in ( [Prokhorov et aLj|1997] l). However, 
as we show below, a naive application of this actor-critic method with neural function approximators 
is unstable for challenging problems. 


Here we combine the actor-critic approach with insights from the recent success of Deep Q Network 
(DQN) ( Mnih et al.[ 2013 2015[l. Prior to DQN, it was generally believed that learning value 


functions using large, non-linear function approximators was difficult and unstable. DQN is able 
to learn value functions using such function approximators in a stable and robust way due to two 
innovations: 1 . the network is trained off-policy with samples from a replay buffer to minimize 
correlations between samples; 2. the network is trained with a target Q network to give consistent 
targets during temporal difference backups. In this work we make use of the same ideas, along with 
batch normalization (Ioffe & Szegedy 20I5[l, a recent advance in deep learning. 


In order to evaluate our method we constructed a variety of challenging physical control problems 
that involve complex multi-joint movements, unstable and rich contact dynamics, and gait behavior. 
Among these are classic problems such as the cartpole swing-up problem, as well as many new 
domains. A long-standing challenge of robotic control is to learn an action policy directly from raw 
sensory input such as video. Accordingly, we place a fixed viewpoint camera in the simulator and 
attempted all tasks using both low-dimensional observations (e.g. joint angles) and directly from 
pixels. 


Our model-free approach which we call Deep DPG (DDPG) can learn competitive policies for all of 
our tasks using low-dimensional observations (e.g. cartesian coordinates or joint angles) using the 
same hyper-parameters and network structure. In many cases, we are also able to learn good policies 
directly from pixels, again keeping hyperparameters and network structure constant[^ 


A key feature of the approach is its simplicity: it requires only a straightforward actor-critic archi¬ 
tecture and learning algorithm with very few “moving parts”, making it easy to implement and scale 
to more difficult problems and larger networks. For the physical control problems we compare our 
results to a baseline computed by a planner ( Tassa et akl 2012[ l that has full access to the underly¬ 
ing simulated dynamics and its derivatives (see supplementary information). Interestingly, DDPG 
can sometimes find policies that exceed the performance of the planner, in some cases even when 
learning from pixels (the planner always plans over the underlying low-dimensional state space). 


2 Background 

We consider a standard reinforcement learning setup consisting of an agent interacting with an en¬ 
vironment E in discrete timesteps. At each timestep t the agent receives an observation Xt, takes 
an action at and receives a scalar reward r^. In all the environments considered here the actions 
are real-valued at S . In general, the environment may be partially observed so that the entire 
history of the observation, action pairs st = (xi, ai,..., at-i,xt) may be required to describe the 
state. Here, we assumed the environment is fully-observed so St = Xt- 

An agent’s behavior is defined by a policy, tt, which maps states to a probability distribution over 
the actions tt: S —>■ 'P{A). The environment, E, may also be stochastic. We model it as a Markov 
decision process with a state space S, action space A = , an initial state distribution p(si), 

transition dynamics p(st+i|st, a*), and reward function r(st, at). 

The return from a state is defined as the sum of discounted future reward Rt = J2i=t at) 

with a discounting factor 7 S [0,1]. Note that the return depends on the actions chosen, and therefore 
on the policy tt, and may be stochastic. The goal in reinforcement learning is to learn a policy which 
maximizes the expected return from the start distribution J — Er-i,si~_E,ai~ 7 r [^i]- We denote the 
discounted state visitation distribution for a policy tt as p^. 

The action-value function is used in many reinforcement learning algorithms. It describes the ex¬ 
pected return after taking an action at in state s* and thereafter following policy tt: 

Q ^t) — ( 1 ) 


* You can view a movie of some of the learned policies at https://goo.gl/J4PIAz 
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Many approaches in reinforcement learning make use of the recursive relationship known as the 
Bellman equation: 

Q^{st,at) = ^rt,st+i^E [r(st, at) + 7[Q (st+ij at+i)]] (2) 


If the target policy is deterministic we can describe it as a function : S ^ A and avoid the inner 
expectation: 

Q^{st,at) = Ert,St+i~-E [r(st,at)+7(3^(st+i,/r(st+i))] (3) 


The expectation depends only on the environment. This means that it is possible to learn off- 
policy, using transitions which are generated from a different stochastic behavior policy /3. 


Q-learning ( Watkins & Dayan] 1992| l, a commonly used off-policy algorithm, uses the greedy policy 
/i(s) = argmax^ (5(s, a). We consider function approximators parameterized by 6^, which we 
optimize by minimizing the loss: 


L{0Q) = E, 




{Q{st,at\0^) - ytY 


(4) 


where 

yt = r[st,at) + ^Q{st+i, y,{st+i)\0^). (5) 

While yt is also dependent on 0^, this is typically ignored. 


The use of large, non-linear function approximators for learning value or action-value functions has 
often been avoided in the past since theoretical performance guarantees are impossible, and prac¬ 
tically learning tends to be unstable. Recently, ( Mnih et al.[ 2013[ 2015| l adapted the Q-learning 
algorithm in order to make effective use of large neural networks as function approximators. Their 
algorithm was able to learn to play Atari games from pixels. In order to scale Q-learning they intro¬ 
duced two major changes: the use of a replay buffer, and a separate target network for calculating 
yt- We employ these in the context of DDPG and explain their implementation in the next section. 


3 Algorithm 


It is not possible to straightforwardly apply Q-learning to continuous action spaces, because in con¬ 
tinuous spaces finding the greedy policy requires an optimization of at at every timestep; this opti¬ 
mization is too slow to be practical with large, unconstrained function approximators and nontrivial 


action spaces. Instead, here we used an actor-critic approach based on the DPG algorithm (Silver 

[eTaLllI^ . 


The DPG algorithm maintains a parameterized actor function p(s|0^) which specifies the current 
policy by deterministically mapping states to a specific action. The critic Q{s, a) is learned using 
the Bellman equation as in Q-learning. The actor is updated by following the applying the chain rule 
to the expected return from the start distribution J with respect to the actor parameters: 


~ Q(s, a|) Is—^j 

= [VaQ(s, 

proved that this is the policy gradient, the gradient of the policy’s performance]^ 

As with Q learning, introducing non-linear function approximators means that convergence is no 
longer guaranteed. However, such approximators appear essential in order to learn and generalize 
on large state spaces. NFQCA ( [Hafner & Riedmiller| |201 l| l, which uses the same update rules as 
DPG but with neural network function approximators, uses batch learning for stability, which is 
intractable for large networks. A minibatch version of NFQCA which does not reset the policy at 
each update, as would be required to scale to large networks, is equivalent to the original DPG, 
which we compare to here. Our contribution here is to provide modifications to DPG, inspired by 
the success of DQN, which allow it to use neural network function approximators to learn in large 
state and action spaces online. We refer to our algorithm as Deep DPG (DDPG, Algorithm[2l. 

^In practice, as in commonly done in policy gradient implementations, we ignored the discount in the state- 
visitation distribution 


|2014 


Silver et al. 
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One challenge when using neural networks for reinforcement learning is that most optimization al¬ 
gorithms assume that the samples are independently and identically distributed. Obviously, when 
the samples are generated from exploring sequentially in an environment this assumption no longer 
holds. Additionally, to make efficient use of hardware optimizations, it is essential to learn in mini¬ 
batches, rather than online. 


As in DQN, we used a replay buffer to address these issues. The replay buffer is a finite sized cache 
TZ. Transitions were sampled from the environment according to the exploration policy and the tuple 
(st, at,rt, S(+i) was stored in the replay buffer. When the replay buffer was full the oldest samples 
were discarded. At each timestep the actor and critic are updated by sampling a minibatch uniformly 
from the buffer. Because DDPG is an off-policy algorithm, the replay buffer can be large, allowing 
the algorithm to benefit from learning across a set of uncorrelated transitions. 


Directly implementing Q learning (equation® with neural networks proved to be unstable in many 
environments. Since the network (3(s, a\9^ being updated is also used in calculating the target 
value (e quation [5|), th e Q update is prone to divergence. Our solution is similar to the target network 
used in ( |Mnih et 2013| l but modified for actor-critic and using “soft” target updates, rather than 
directly copying the weights. We create a copy of the actor and critic networks, Q'{s, a\9'^ ) and 
li'{s\9^^ ) respectively, that are used for calculating the target values. The weights of these target 
networks are then updated by having them slowly track the learned networks: 9' ^ t9 -\- {1 — 
t)9' with r <C 1. This means that the target values are constrained to change slowly, greatly 
improving the stability of learning. This simple change moves the relatively unstable problem of 
learning the action-value function closer to the case of supervised learning, a problem for which 
robust solutions exist. We found that having both a target /i' and Q' was required to have stable 
targets yi in order to consistently train the critic without divergence. This may slow learning, since 
the target network delays the propagation of value estimations. However, in practice we found this 
was greatly outweighed by the stability of learning. 


When learning from low dimensional feature vector observations, the different components of the 
observation may have different physical units (for example, positions versus velocities) and the 
ranges may vary across environments. This can make it difficult for the network to learn effec¬ 
tively and may make it difficult to find hyper-parameters which generalise across environments with 
different scales of state values. 


One approach to this problem is to manually scale the features so they are in similar ranges across 
environments and units. We address this issue by adapting a recent technique from deep learning 
called batch normalization poffe & Szegedy| 2015| l. This technique normalizes each dimension 
across the samples in a minibatch to have unit mean and variance. In addition, it maintains a run¬ 
ning average of the mean and variance to use for normalization during testing (in our case, during 
exploration or evaluation). In deep networks, it is used to minimize covariance shift during training, 
by ensuring that each layer receives whitened input. In the low-dimensional case, we used batch 
normalization on the state input and all layers of the /i network and all layers of the Q network prior 
to the action input (details of the networks are given in the supplementary material). With batch 
normalization, we were able to learn effectively across many different tasks with differing types of 
units, without needing to manually ensure the units were within a set range. 


A major challenge of learning in continuous action spaces is exploration. An advantage of off- 
policies algorithms such as DDPG is that we can treat the problem of exploration independently 
from the learning algorithm. We constructed an exploration policy fj! by adding noise sampled from 
a noise process N to our actor policy 


ti'{st)=tJi{st\9>i)+M (7) 

N can be chosen to chosen to suit the environment. As detailed in the supplementary materials we 
used an Ornstein-Uhlenbeck process ( Uhlenbeck & Ornstein| 1930| l to generate temporally corre¬ 
lated exploration for exploration efficiency in physical control problems with inertia (similar use of 
autocorrelated noise was introduced in (Wawrzyhski 2015|l). 


4 Results 


We constructed simulated physical environments of varying levels of difficulty to test our algorithm. 
This included classic reinforcement learning environments such as cartpole, as well as difficult. 
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Algorithm 1 DDPG algorithm 

Randomly initialize critic network Q{s, a\6^) and actor ii{s\0^) with weights 9^ and 9^. 
Initialize target network Q' and p' with weights 9^ 9^ ,9^ ^r- 9^ 

Initialize replay buffer R 
for episode = 1, M do 

Initialize a random process M for action exploration 
Receive initial observation state si 

for t = 1, T do 

Select action at = fj,{st\9^) + Aft according to the cutTent policy and exploration noise 
Execute action at and observe reward rt and observe new state st+i 
Store transition [st, at,rt, St+i) in R 

Sample a random minibatch of N transitions {si, ai,ri, Si+i) from R 
Set yi =r, + {si+i, fj,'{s,+i\9f^')\9^') 

Update critic by minimizing the loss: L = ^ ^iiUi — Q{si, ai\9'^))^ 

Update the actor policy using the sampled policy gradient: 

i 

Update the target networks: 

0Q' ^ t-qQ + (1 _ 

^ 

end for 
end for 


high dimensional tasks such as gripper, tasks involving contacts such as puck striking (canada) 
and locomotion tasks such as cheetah ( |Wawrzynski 20091. In all domains but cheetah the actions 
were torques applied to the actuated joints. These environments were simulated using MuJoCo 
( Todorov et al.[ 2012| l. Figureshows renderings of some of the environments used in the task (the 
supplementary contains details of the environments and you can view some of the learned policies 
at https : / /goo . gl/J4PIAz ). 


In all tasks, we ran experiments using both a low-dimensional state description (such as joint angles 
and positions) and high-dimensional renderings of the environment. As in DQN ( |Mnih et ar||2013| 
2015[ ), in order to make the problems approximately fully observable in the high dimensional envi¬ 
ronment we used action repeats. For each timestep of the agent, we step the simulation 3 timesteps, 
repeating the agent’s action and rendering each time. Thus the observation reported to the agent 
contains 9 feature maps (the RGB of each of the 3 renderings) which allows the agent to infer veloc¬ 
ities using the differences between frames. The frames were downsampled to 64x64 pixels and the 
8-bit RGB values were converted to floating point scaled to [0,1]. See supplementary information 
for details of our network structure and hyperparameters. 


We evaluated the policy periodically during training by testing it without exploration noise. Figure 
l^shows the performance curve for a selection of environments. We also report results with compo¬ 
nents of our algorithm (i.e. the target network or batch normalization) removed. In order to perform 
well across all tasks, both of these additions are necessary. In particular, learning without a target 
network, as in the original work with DPG, is very poor in many environments. 


Surprisingly, in some simpler tasks, learning policies from pixels is just as fast as learning using the 
low-dimensional state descriptor. This may be due to the action repeats making the problem simpler. 
It may also be that the convolutional layers provide an easily separable representation of state space, 
which is straightforward for the higher layers to learn on quickly. 

Table [T] summarizes DDPG’s performance across all of the environments (results are averaged over 
5 replicas). We normalized the scores using two baselines. The first baseline is the mean return 
from a naive policy which samples actions from a uniform distribution over the valid action space. 
The second baseline is iUQG (|Todorov & Fi| |2005|l, a planning based solver with full access to the 
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underlying physical model and its derivatives. We normalize scores so that the naive policy has a 
mean score of 0 and iLQG has a mean score of 1. DDPG is able to learn good policies on many of 
the tasks, and in many cases some of the replicas learn policies which are superior to those found by 
iLQG, even when learning directly from pixels. 


It can be challenging to learn accurate value estimates. Q-learning, for example, is prone to over¬ 
estimating values ( |Hasselt 20101. We examined DDPG’s estimates empirically by comparing the 
values estimated by Q after training with the true returns seen on test episodes. Figurej^shows that 
in simple tasks DDPG estimates returns accurately without systematic biases. For harder tasks the 
Q estimates are worse, but DDPG is still able learn good policies. 


To demonstrate the generality of our approach we also include Tores, a racing game where the 
actions are acceleration, braking and steering. Tores has previously been used as a testbed in other 
policy learning approaches ( [Koutnrk et al.[ 2014b| l. We used an identical network architecture and 
learning algorithm hyper-parameters to the physics tasks but altered the noise process for exploration 
because of the very different time scales involved. On both low-dimensional and from pixels, some 
replicas were able to learn reasonable policies that are able to complete a circuit around the track 
though other replicas failed to learn a sensible policy. 



V ' i i ^ 



Figure 1: Example screenshots of a sample of environments we attempt to solve with DDPG. In 
order from the left: the cartpole swing-up task, a reaching task, a gasp and move task, a puck-hitting 
task, a monoped balancing task, two locomotion tasks and Tores (driving simulator). We tackle 
all tasks using both low-dimensional feature vector and high-dimensional pixel inputs. Detailed 
descriptions of the environments are provided in the supplementary. Movies of some of the learned 
policies are available at https : //goo . gl/ J4PIAz 
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Figure 2: Performance curves for a selection of domains using variants of DPG: original DPG 
algorithm (minibatch NFQCA) with batch normalization (light grey), with target network (dark 
grey), with target networks and batch normalization (green), with target networks from pixel-only 
inputs (blue). Target networks are crucial. 


5 Related work 

The original DPG paper evaluated the algorithm with toy problems using tile-coding and linear 
function approximators. It demonstrated data efficiency advantages for off-policy DPG over both 
on- and off-policy stochastic actor critic. It also solved one more challenging task in which a multi- 
jointed octopus arm had to strike a target with any part of the limb. However, that paper did not 
demonstrate scaling the approach to large, high-dimensional observation spaces as we have here. 

It has often been assumed that standard policy search methods such as those explored in the present 
work are simply too fragile to scale to difficult problems ([Levine et al.||2015]l. Standard policy search 
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Figure 3; Density plot showing estimated Q values versus observed returns sampled from test 
episodes on 5 replicas. In simple domains such as pendulum and cartpole the Q values are quite 
accurate. In more complex tasks, the Q estimates are less accurate, but can still be used to learn 
competent policies. Dotted line indicates unity, units are arbitrary. 


Table 1; Performance after training across all environments for at most 2.5 million steps. We report 
both the average and best observed (across 5 runs). All scores, except Tores, are normalized so 
that a random agent receives 0 and a planning algorithm 1; for Tores we present the raw reward 
score. We include results from the DDPG algorithn in the low-dimensional (lowd) version of the 
environment and high-dimensional (pix). For comparision we also include results from the original 
DPG algorithm with a replay buffer and batch normalization (cntrl). 


environment 

^av,lowd 

^best,lowd 

Rav,pix 

^best,pix 

^av,cntrl 

^best,cntrl 

blockworldl 

1.156 

1.511 

0.466 

1.299 

-0.080 

1.260 

blockworld3da 

0.340 

0.705 

0.889 

2.225 

-0.139 

0.658 

Canada 

0.303 

1.735 

0.176 

0.688 

0.125 

1.157 

canada2d 

0.400 

0.978 

-0.285 

0.119 

-0.045 

0.701 

cart 

0.938 

1.336 

1.096 

1.258 

0.343 

1.216 

cartpole 

0.844 

1.115 

0.482 

1.138 

0.244 

0.755 

cartpoleBalance 

0.951 

1.000 

0.335 

0.996 

-0.468 

0.528 

cartpoleParallelDouble 

0.549 

0.900 

0.188 

0.323 

0.197 

0.572 

cartpoleSerialDouble 

0.272 

0.719 

0.195 

0.642 

0.143 

0.701 

cartpoleSerialTriple 

0.736 

0.946 

0.412 

0.427 

0.583 

0.942 

cheetah 

0.903 

1.206 

0.457 

0.792 

-0.008 

0.425 

fixedReacher 

0.849 

1.021 

0.693 

0.981 

0.259 

0.927 

fixedReacherDouble 

0.924 

0.996 

0.872 

0.943 

0.290 

0.995 

fixedReacherSingle 

0.954 

1.000 

0.827 

0.995 

0.620 

0.999 

gripper 

0.655 

0.972 

0.406 

0.790 

0.461 

0.816 

gripperRandom 

0.618 

0.937 

0.082 

0.791 

0.557 

0.808 

hardCheetah 

1.311 

1.990 

1.204 

1.431 

-0.031 

1.411 

hopper 

0.676 

0.936 

0.112 

0.924 

0.078 

0.917 

hyq 

0.416 

0.722 

0.234 

0.672 

0.198 

0.618 

movingGripper 

0.474 

0.936 

0.480 

0.644 

0.416 

0.805 

pendulum 

0.946 

1.021 

0.663 

1.055 

0.099 

0.951 

reacher 

0.720 

0.987 

0.194 

0.878 

0.231 

0.953 

reacher3daFixedTarget 

0.585 

0.943 

0.453 

0.922 

0.204 

0.631 

reacher3daRandomTarget 

0.467 

0.739 

0.374 

0.735 

-0.046 

0.158 

reacherSingle 

0.981 

1.102 

1.000 

1.083 

1.010 

1.083 

walker2d 

0.705 

1.573 

0.944 

1.476 

0.393 

1.397 

tores 

-393.385 

1840.036 

-401.911 

1876.284 

-911.034 

1961.600 


is thought to be difficult because it deals simultaneously with complex environmental dynamics and 
a complex policy. Indeed, most past work with actor-critic and policy optimization approaches have 
had difficulty scaling up to more challenging problems (Deisenroth et al. 2013|l. Typically, this 


is due to instability in learning wherein progress on a problem is either destroyed by subsequent 
learning updates, or else learning is too slow to be practical. 

Recent work with model-free policy search has demonstrated that it may not be as fragile as previ¬ 
ously supposed. [Wawrzyhski (|2009|l; Wawrzyhski & Tanwani (2013|l has trained stochastic policies 
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in an a ctor-critic framework with a replay buffer. Concurrent with our work, Balduzzi & Ghifat^^ 
(2015|l extended the DPG algorithm with a “deviator” network which explicitly learns dQ/da. How¬ 
ever, they only train on two low-dimensional domains. Heess et al. p015| l introduced SVG(O) which 


also uses a Q-critic but learns a stochastic policy. DPG can be considered the deterministic limit of 
SVG(O). The techniques we described here for scaling DPG are also applicable to stochastic policies 
by using the reparametrization trick (Heess et al. |2015[[Schulman et al. 2015a) l. 

Another approach, trust region policy optimization (TRPO) (Schulman et al. |2015b| l, directly con¬ 
structs stochastic neural network policies without decomposing problems into optimal control and 
supervised phases. This method produces near monotonic improvements in return by making care¬ 
fully chosen updates to the policy parameters, constraining updates to prevent the new policy from 
diverging too far from the existing policy. This approach does not require learning an action-value 
function, and (perhaps as a result) appears to be significantly less data efficient. 

To combat the challenges of the actor-critic approach, recent work with guided policy search (GPS) 
algorithms (e.g., ( [Levine et al. 2015) l) decomposes the problem into three phases that are rela¬ 
tively easy to solve: first, it uses full-state observations to create locally-linear approximations of 
the dynamics around one or more nominal trajectories, and then uses optimal control to find the 
locally-linear optimal policy along these trajectories; finally, it uses supervised learning to train a 
complex, non-linear policy (e.g. a deep neural network) to reproduce the state-to-action mapping of 
the optimized trajectories. 

This approach has several benefits, including data efficiency, and has been applied successfully to 
a variety of real-world robotic manipulation tasks using vision. In these tasks GPS uses a similar 
convolutional policy network to ours with 2 notable exceptions: 1. it uses a spatial softmax to reduce 
the dimensionality of visual features into a single {x, y) coordinate for each feature map, and 2. the 
policy also receives direct low-dimensional state information about the configuration of the robot at 
the first fully connected layer in the network. Both likely increase the power and data efficiency of 
the algorithm and could easily be exploited within the DDPG framework. 


PILCO ( jPeisenroth & Rasmussen 20111 uses Gaussian processes to learn a non-parametric, proba¬ 
bilistic model of the dynamics. Using this learned model, PILCO calculates analytic policy gradients 
and achieves impressive data efficiency in a number of control problems. However, due to the high 
computational demand, PILCO is “impractical for high-dimensional problems” (Wahlstrom et al. 
|2015| ). It seems that deep function approximators are the most promising approach for scaling rein¬ 
forcement learning to large, high-dimensional domains. 


Wahlstrom et al. (|2015|l used a deep dynamical model network along with model predictive control 


to solve the pendulum swing-up task from pixel input. They trained a differentiable forward model 
and encoded the goal state into the learned latent space. They use model-predictive control over the 
learned model to find a policy for reaching the target. However, this approach is only applicable to 
domains with goal states that can be demonstrated to the algorithm. 

Recently, evolutionary approaches have been used to learn competitive policies for Tores from pixels 
using compressed weight parametrizations (Koutnrk et al. 2014a| l or unsupervised learning (Koutnik 
et al. 2014b|l to reduce the dimensionality of the evolved weights. It is unclear how well these 


approaches generalize to other problems. 


6 Conclusion 


The work combines insights from recent advances in deep learning and reinforcement learning, re¬ 
sulting in an algorithm that robustly solves challenging problems across a variety of domains with 
continuous action spaces, even when using raw pixels for observations. As with most reinforcement 
learning algorithms, the use of non-linear function approximators nullifies any convergence guar¬ 
antees; however, our experimental results demonstrate that stable learning without the need for any 
modifications between environments. 

Interestingly, all of our experiments used substantially fewer steps of experience than was used by 
DQN learning to find solutions in the Atari domain. Nearly all of the problems we looked at were 
solved within 2.5 million steps of experience (and usually far fewer), a factor of 20 fewer steps than 
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DQN requires for good Atari solutions. This suggests that, given more simulation time, DDPG may 
solve even more difficult problems than those considered here. 


A few limitations to our approach remain. Most notably, as with most model-free reinforcement 
approaches, DDPG requires a large number of training episodes to find solutions. However, we 
believe that a robust model-free approach may be an important component of larger systems which 
may attack these limitations (Glascher et al. 20101. 
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Supplementary Information: Continuous control with 
deep reinforcement learning 


7 Experiment Details 


We used Adam ( |Kingma & Ba 20141 for learning the neural network parameters with a learning 
rate of 10“^ and 10“'^ for the actor and critic respectively. For Q we included L 2 weight decay of 
10“^ and used a discount factor of 7 = 0.99. For the soft target updates we used r = 0.001. The 
neural networks used the rectified non-linearity (|Glorot et al.| 12011)1 for all hidden layers. The final 
output layer of the actor was a tanh layer, to bound the actions. The low-dimensional networks 
had 2 hidden layers with 400 and 300 units respectively (« 130,000 parameters). Actions were not 
included until the 2nd hidden layer of Q. When learning from pixels we used 3 convolutional layers 
(no pooling) with 32 filters at each layer. This was followed by two fully connected layers with 
200 units (« 430,000 parameters). The final layer weights and biases of both the actor and critic 
were initialized from a uniform distribution [—3 x 10“^, 3 x 10“^] and [3 x 10“^, 3 x 10“^] for the 
low dimensional and pixel cases respectively. This was to ensure the initial outputs for the policy 
and value estimates were near zero. The other layers were initialized from uniform distributions 
^where / is the fan-in of the layer. The actions were not included until the fully-connected 
layers. We trained with minibatch sizes of 64 for the low dimensional problems and 16 on pixels. 
We used a replay buffer size of 10®. 

For the exploration noise process we used temporally cotTelated noise in order to explore well in 
physical environments that have momentum. We used an Ornstein-Uhlenbeck process (Uhlenbeck 
& Ornstein 19301 l with 0 = 0.15 and a = 0.2. The Ornstein-Uhlenbeck process models the velocity 
of a Brownian particle with friction, which results in temporally correlated values centered around 
0 . 


8 Planning algorithm 


Our planner is implemented as a model-predictive controller ( |Tassa et aL] 2012|i: a t every time step 
we run a single iteration of trajectory optimization (using iLQG, ( Todorov & Li[ |2005| l), starting 
from the true state of the system. Every single trajectory optimization is planned over a horizon 
between 250ms and 600ms, and this planning horizon recedes as the simulation of the world unfolds, 
as is the case in model-predictive control. 


The iLQG iteration begins with an initial rollout of the previous policy, which determines the nom¬ 
inal trajectory. We use repeated samples of simulated dynamics to approximate a linear expansion 
of the dynamics around every step of the trajectory, as well as a quadratic expansion of the cost 
function. We use this sequence of locally-linear-quadratic models to integrate the value function 
backwards in time along the nominal trajectory. This back-pass results in a putative modification to 
the action sequence that will decrease the total cost. We perform a derivative-free line-search over 
this direction in the space of action sequences by integrating the dynamics forward (the forward- 
pass), and choose the best trajectory. We store this action sequence in order to warm-start the next 
iLQG iteration, and execute the first action in the simulator. This results in a new state, which is 
used as the initial state in the next iteration of trajectory optimization. 


9 Environment details 

9.1 TORCS ENVIRONMENT 

For the Tores environment we used a reward function which provides a positive reward at each step 
for the velocity of the car projected along the track direction and a penalty of —1 for collisions. 
Episodes were terminated if progress was not made along the track after 500 frames. 
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9.2 MUJOCO ENVIRONMENTS 


For physical control tasks we used reward functions which provide feedback at every step. In all 
tasks, the reward contained a small action cost. For all tasks that have a static goal state (e.g. 
pendulum swingup and reaching) we provide a smoothly varying reward based on distance to a goal 
state, and in some cases an additional positive reward when within a small radius of the target state. 
For grasping and manipulation tasks we used a reward with a term which encourages movement 
towards the payload and a second component which encourages moving the payload to the target. In 
locomotion tasks we reward forward action and penalize hard impacts to encourage smooth rather 

2015b)l. In addition, we used a negative reward and early 


than hopping gaits (Schulman et al. 


termination for falls which were determined by simple threshholds on the height and torso angle (in 
the case of walker2d). 


Table Instates the dimensionality of the problems and below is a summary of all the physics envi¬ 
ronments. 


task name 

blockworldl 

blockworld3da 

Canada 

canada2d 

cart 

cart pole 

cartpoleBalance 

cartpoleParallelDouble 

cartpoleParallelTriple 

cartpoleSeiialDouble 

c artpoleS eii alTriple 

cheetah 

fixedReacher 

fixedReacherDouble 

fixedReacherSingle 

gripper 

gripperRandom 

hardCheetah 

hardCheetahNice 

hopper 

hyq 

hyqKick 

movingGripper 

movingGripperRandom 

pendulum 

reacher 

reacher3daFixedTarget 

reacher3daRandomTarget 

reacherDouble 

reacherObstacle 

re acherS ingle 

walker2d 


dim(s) 

dim(a) 

dim(o) 

18 

5 

43 

31 

9 

102 

22 

7 

62 

14 

3 

29 

2 

1 

3 

4 

1 

14 

4 

1 

14 

6 

1 

16 

8 

1 

23 

6 

1 

14 

8 

1 

23 

18 

6 

17 

10 

3 

23 

8 

2 

18 

6 

1 

13 

18 

5 

43 

18 

5 

43 

18 

6 

17 

18 

6 

17 

14 

4 

14 

37 

12 

37 

37 

12 

37 

22 

7 

49 

22 

7 

49 

2 

1 

3 

10 

3 

23 

20 

7 

61 

20 

7 

61 

6 

1 

13 

18 

5 

38 

6 

1 

13 

18 

6 

41 


Table 2: Dimensionality of the MuJoCo tasks: the dimensionality of the underlying physics model 
dim(s), number of action dimensions dim(a) and observation dimensions dim(o). 


task name 

Brief Description 

blockworldl 

Agent is required to use an arm with gripper constrained to the 2D plane 
to grab a falling block and lift it against gravity to a fixed target position. 
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blockworld3da 

Agent is required to use a human-like arm with 7-DOF and a simple 
gripper to grab a block and lift it against gravity to a fixed target posi¬ 
tion. 

Canada 

Agent is required to use a 7-DOF arm with hockey-stick like appendage 
to hit a ball to a target. 

canada2d 

Agent is required to use an arm with hockey-stick like appendage to hit 
a ball initialzed to a random start location to a random target location. 

cart 

Agent must move a simple mass to rest at 0. The mass begins each trial 
in random positions and with random velocities. 

cart pole 

The classic cart-pole swing-up task. Agent must balance a pole at¬ 
tached to a cart by applying forces to the cart alone. The pole starts 
each episode hanging upside-down. 

cartpoleBalance 

The classic cart-pole balance task. Agent must balance a pole attached 
to a cart by applying forces to the cart alone. The pole starts in the 
upright positions at the beginning of each episode. 

cartpoleParallelDouble 

Variant on the classic cart-pole. Two poles, both attached to the cart, 
should be kept upright as much as possible. 

cartpoleSerialDouble 

Variant on the classic cart-pole. Two poles, one attached to the cart and 
the second attached to the end of the first, should be kept upright as 
much as possible. 

c artpoleS eri alTriple 

Variant on the classic cart-pole. Three poles, one attached to the cart, 
the second attached to the end of the first, and the third attached to the 
end of the second, should be kept upright as much as possible. 

cheetah 

The agent should move forward as quickly as possible with a cheetah¬ 
like body that is constrained to the plane. This environment is based 
very closely on the one introduced by|Wawrzyhski ( 2009)1;|Wawrzyhski 
& Tanwani (2013|l. 

fixedReacher 

Agent is required to move a 3-DOF arm to a fixed target position. 

fixedReacherDouble 

Agent is required to move a 2-DOF arm to a fixed target position. 

fixedReacherSingle 

Agent is required to move a simple 1-DOF arm to a fixed target position. 

gripper 

Agent must use an atm with gripper appendage to grasp an object and 
manuver the object to a fixed target. 

gripperRandom 

The same task as gripper except that the arm object and target posi¬ 
tion are initialized in random locations. 

hardCheetah 

The agent should move forward as quickly as possible with a cheetah¬ 
like body that is constrained to the plane. This environment is based 
very closely on the one introduced by|Wawrzyhski (|2009 i;|Wawrzyhski 
& Tanwani|p0131l, but has been made much more difficult by removing 
the stabalizing joint stiffness from the model. 

hopper 

Agent must balance a multiple degree of freedom monoped to keep it 
from falling. 

hyq 

Agent is required to keep a quadroped model based on the hyq robot 
from falling. 
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movingGripper 

movingGripperRandom 

pendulum 

reacher3daFixedTarget 

reacherSdaRandomTarget 

reacher 

reacherSingle 

reacherObstacle 

walker2d 


Agent must use an atm with gripper attached to a moveable platform to 
grasp an object and move it to a fixed target. 

The same as the movingGripper environment except that the object po¬ 
sition, target position, and arm state are initialized randomly. 

The classic pendulum swing-up problem. The pendulum should be 
brought to the upright position and balanced. Torque limits prevent the 
agent from swinging the pendulum up directly. 

Agent is required to move a 7-DOF human-like arm to a fixed target 
position. 

Agent is required to move a 7-DOF human-like atm from random start¬ 
ing locations to random target positions. 

Agent is required to move a 3-DOF arm from random starting locations 
to random target positions. 

Agent is required to move a simple 1-DOF arm from random starting 
locations to random target positions. 

Agent is required to move a 5-DOF arm around an obstacle to a ran¬ 
domized target position. 

Agent should move forward as quickly as possible with a bipedal walker 
constrained to the plane without falling down or pitching the torso too 
far forward or backward. 
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